Back

Journal of Chemical Information and Modeling

American Chemical Society (ACS)

Preprints posted in the last 30 days, ranked by how well they match Journal of Chemical Information and Modeling's content profile, based on 207 papers previously published here. The average preprint has a 0.22% match score for this journal, so anything above that is already an above-average fit.

1
Deciphering the Molecular Structure of the Type III Secretion System in Chlamydia trachomatis for Structure-Based Therapeutic Targeting

Panda, A.; Kapoor, J.; Rajagopal, R.; Kumar, S.; Bandyopadhyay, A.

2026-05-09 bioinformatics 10.64898/2026.05.06.723290 medRxiv
Top 0.1%
55.2%
Show abstract

Chlamydia trachomatis is an obligate intracellular Gram-negative pathogen responsible for sexually transmitted infections and trachoma in humans. Although antibiotics are generally effective against acute infections, persistent chlamydial forms often exhibit reduced susceptibility during chronic infection. Chlamydia relies on its type III secretion system (T3SS) to inject effector proteins into host cells, making T3SS proteins attractive targets for antivirulence therapeutics. In this study, we employed an integrated computational pipeline to model and assemble the C. trachomatis T3SS constituent proteins. Template-based modeling using crystallographic structures of homologs from other Gram-negative bacteria revealed a highly conserved structural architecture despite low sequence identity (18-46%). Stereochemical validation confirmed high model quality, with most T3SS proteins exhibiting favorable protein-protein interactions (PPIs). Since the activity of the T3SS complex relies on extensive PPIs, we targeted these PPIs as a promising approach to attenuate bacterial virulence. CdsN, which functions as an ATPase of the T3SS, is a hexamer of which we targeted the dimerization interface. Structure-based virtual screening of compounds from the e-Drug3D and IMPPAT libraries against predicted hotspot residues and the identified druggable pocket at the CdsN dimeric interface, followed by ADMET screening, yielded three promising candidates: M Roflumilast (Drug ID: 1537), Elacestrant (Drug ID: 2081), and Tecovirimat (Drug ID: 1889). All three ligands formed thermodynamically stable complexes with the CdsN dimer, with Elacestrant demonstrating the most favourable binding free energy. This was also confirmed by 100 ns molecular dynamics simulation. This study provides new insights into the molecular architecture of C. trachomatis T3SS and identifies M Roflumilast, Elacestrant, and Tecovirimat as potential drug candidates against chlamydial infection. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=129 SRC="FIGDIR/small/723290v1_ufig1.gif" ALT="Figure 1"> View larger version (58K): org.highwire.dtl.DTLVardef@1821599org.highwire.dtl.DTLVardef@1581baaorg.highwire.dtl.DTLVardef@1805e98org.highwire.dtl.DTLVardef@c25e56_HPS_FORMAT_FIGEXP M_FIG C_FIG

2
TwinSAR: An Adaptive Kernel-based Algorithm with logit-transformed Z-score Filtering for Chemical Twin Detection in Large-scale Virtual Screening

Haris Kulosmanovic, H.; Uguz, C.; DURDAGI, S.

2026-05-15 bioinformatics 10.64898/2026.05.12.724687 medRxiv
Top 0.1%
46.4%
Show abstract

Molecular similarity searching is a workhorse of cheminformatics, but the dominant Tanimoto/topological-fingerprint paradigm has well-known blind spots. It is highly sensitive to molecular size, suffers from steep activity cliffs, and frequently fails to retrieve scaffold-hopping bioisosteres. A complementary descriptor that has received comparatively little attention is global elemental composition. Despite the conceptual simplicity of comparing molecules by their elemental ratios, no widely deployed method exists for the statistically rigorous identification of "chemical twins" defined by stoichiometric proximity. We address this gap with TwinSAR (Stoichiometric Analysis and Retrieval), an adaptive kernel-based algorithm that combines three methodological innovations: (i) binary fingerprint blocking that partitions molecule by element-presence patterns and bounds the cost of all-pairs comparison from O(NM) to O({sum}nimi) enabling million/billion-scale searches; (ii) a per-block adaptive radial basis function (RBF) kernel whose precision parameter is calibrated independently for each fingerprint block via the median heuristic, providing fair similarity comparison across chemical sub-spaces of vastly different density; and (iii) a logit-transformed Z-score filter that maps bounded RBF scores onto an unbounded scale, allowing high-similarity pairs to be prioritized relative to the empirical score distribution of their own fingerprint block. TwinSAR is offered in two operating modes: (i) a deterministic BULK mode for exact reproducibility; and (ii) a stochastic FAST mode that achieved a 3.29x wall-clock speed-up in the present benchmark while preserving the similar unique-query and unique-target coverage. Statistical validation showed that detected twin pairs are 12.7x more similar in absolute ratio space than block-matched random pairs (p < 0.001), while a column-permutation negative control returned a median of zero spurious twins across three independent permutations. A controlled benchmark further established that an 8-element representation (single-element heavy-atom ratios) is sensitivity-equivalent to a comprehensive 254-element representation while running 3.55x faster. As a case study, TwinSAR was deployed in an end-to-end virtual screening pipeline against the BCL-2 target protein, where it reduced a 327,071-compound commercial library to a 390 focused candidate panel. The chemical interpretability of the retrieved twins is illustrated by their structural diversity around conserved heavy-atom skeletons. TwinSAR therefore provides a fast, conformation-free, and statistically principled prefilter that is fully orthogonal to topological fingerprints.

3
The genetically-encoded amino acids distribute non-randomly within a functionally-relevant chemical space

Brown, S. M.; Hervey, J.; Dean, S. N.; Vora, G. J.

2026-05-07 synthetic biology 10.64898/2026.05.06.723277 medRxiv
Top 0.1%
41.0%
Show abstract

The standard set of 20 genetically-encoded amino acids (C20) exhibits a statistically non-random distribution in primarily two structurally-relevant physicochemical properties: hydrophobicity and molecular volume, and to a lesser extent charge. It remains an open question, however, whether evolutionary pressures similarly optimized the same alphabet for the distribution of functionally-relevant properties, such as reactivity. In this study, we used semi-empirical quantum chemistry simulations to calculate the highest occupied molecular orbital and lowest unoccupied molecular orbital (HOMO-LUMO) gaps for 84 xeno amino acids and constructed 10 million random 20-mer amino acid alphabets to determine where C20 fit amongst this background. The HOMO-LUMO gap measurements demonstrated that C20, similar to hydrophobicity and volume, also exhibits a non-random distribution. However, unlike hydrophobicity and volume, this distribution is non-random across an unevenly broad range. The results expand upon previous theory and suggest HOMO-LUMO gap energies as one synthetic biologists may consider when developing novel protein design tools or designing functional xeno amino acid alphabets. HighlightsO_LILifes amino acid alphabet is non-randomly distributed within an expanded computationally-generated chemistry space generated from large-scale quantum chemistry simulations. C_LIO_LIAmino acid alphabet coverage theory applies beyond structurally-relevant physicochemical descriptors to include functionally-relevant properties like reactivity as measured by frontier molecular orbitals C_LIO_LIFindings here provide a theoretical framework to guide the design of novel proteins and development of synthetic amino acid alphabets. C_LI

4
Linobectide: a mathematical-chemistry modified black-hole algorithmic framework for ORF1p inhibitor design

GRIGORIADIS, I.

2026-05-08 biophysics 10.64898/2026.05.06.723314 medRxiv
Top 0.1%
37.6%
Show abstract

Computer-aided drug design for conditional biomolecular interfaces requires evaluation across more than one receptor structure, docking pose, or scalar score. LINE-1 ORF1p is treated here as a state-family interface target whose relevant behavior is distributed across receptor microstates, assembly-compatible contact neighborhoods, ligand conformers, and perturbation snapshots. This article presents Linobectide as a mathematical-chemistry CADD workflow centered on a modified black-hole algorithm (MBHA) for persistence-weighted prioritization of putative ORF1p inhibitor candidates. Each molecule is represented as a dossier containing standardized descriptors, docking annotations, interaction-class persistence vectors, finite-action stability traces, graph-localization summaries, SPECTRAL-SAR applicability-domain records, and rank-shift diagnostics. The revised analysis emphasizes numerical reporting endpoints: fixed run parameters, baseline comparators, ablation metrics, rank stability, regeneration fractions, protected-elite fractions, and reproducibility indices. Docking is used as an annotation layer rather than as a stand-alone proof of inhibition. The framework is therefore reported as a transparent computational prioritization protocol that generates testable hypotheses for future biochemical and cellular validation, not as experimental proof of ORF1p inhibition or therapeutic activity. Author summaryDrug-design workflows can become over-dependent on the best docking pose even when an interface target remains functional through alternative contact corridors. Linobectide addresses this issue by ranking candidates only after docking annotations are aggregated across receptor-state and perturbation conditions. The MBHA search promotes a candidate when interaction persistence, finite-action stability, graph localization, SPECTRAL-SAR coherence, applicability-domain support, and reproducibility checks are concordant. The revision removes unsupported claims of performance advantage and replaces them with benchmarkable endpoints that can be compared with docking-only, consensus-docking, and ablated MBHA baselines. The SI Appendix is retained as a figure atlas for state-family construction, graph-localization diagnostics, docking provenance, consensus geometry, and comparative triage.

5
Antimicrobial peptide databases and prediction tools: Toward a standard evaluation framework

Cisterna Garcia, A.; Gonzalez Lopez, A. M.; Vozi, A.; Esteban, M. A.; Egli, A.; Jutzeler, C.; Palma, J.; Sanchez-Ferrer, A.; Botia, J. A.

2026-05-21 bioinformatics 10.64898/2026.05.19.726290 medRxiv
Top 0.1%
34.1%
Show abstract

Antimicrobial resistance (AMR) has a profound impact on animal and human health and is associated with substantial morbidity, mortality and public health costs. There is a clear need to develop novel, effective antibiotic agents, which can overcome the current AMR crisis. Antimicrobial peptides (AMPs) may offer such a solution and have attracted growing attention for their potential to combat AMR. In parallel, the growing availability of peptide sequences in public databases has stimulated the development of numerous machine learning and deep learning tools to predict antimicrobial activity computationally. However, it remains unclear how reliably these tools can be compared, as existing studies often rely on heterogeneous datasets and inconsistent evaluation protocols that may lead to data leakage and inflated performance estimates. This raises a central question: what evaluation criteria and benchmark resources are needed to enable fair, reproducible, and biologically meaningful assessment of AMP prediction tools? We address this question by focusing specifically on antibacterial peptides (ABPs). We first provide an overview of AMP databases relevant to antibacterial activity and compare their content, redundancy, and experimental metadata. We then critically assess existing computational tools for ABP prediction, highlighting key limitations related to dataset construction, affinity to certain sequences, data leakage, and inconsistent performance reporting. Based on these limitations, we propose a reference evaluation framework designed to improve comparability, reproducibility, and practical utility in ABP prediction. Finally, we provide targeted recommendations for AMP databases and future tool development to support more robust progress in the computational discovery of ABPs.

6
SuBMIT: A Software Toolkit for Facilitating Simulations of Coarse-Grained Structure-Based Models of Biomolecules.

Prakash, D. L.; Banerjee, A.; Gosavi, S.

2026-05-20 biophysics 10.64898/2026.05.18.725912 medRxiv
Top 0.1%
33.5%
Show abstract

Coarse-grained structure-based models (CG-SBMs; or G[o] models) are simplified potential energy functions of biomolecules or biomolecular complexes that encode their structure. Molecular dynamics simulations of such SBMs have been successfully used to study long time-scale dynamics such as protein and RNA folding, and large conformational transitions of biomolecular complexes. SBMs have several advantages: (1) Their MD simulations are computationally inexpensive, making extensive sampling easily accessible to many researchers. (2) They are easy to modify and can be adapted for the specific biomolecular problem that needs to be investigated. However, the force-fields of SBMs are not usually included in commonly used biomolecular simulation packages resulting in a barrier to their use. Here, we present SuBMIT (Structure Based Models Input Toolkit; https://github.com/sglabncbs/submit), a toolkit for generating coarse-grained SBM input files for performing MD simulations with GROMACS and OpenMM/OpenSMOG. Simulations whose input files can be generated using the different flavors of CG-SBMs present in SuBMIT include the folding and conformational ensembles of proteins with intrinsically disordered regions, 3D-domain-swapping in proteins and the dynamics of RNA-protein assemblies (e.g., simple RNA viruses).

7
Stereochemistry-Aware Drug-Target Affinity Prediction

Ferreyra, S.; Dutra, I.; Galeano, A.; Paccanaro, A.

2026-05-18 bioinformatics 10.64898/2026.05.14.725200 medRxiv
Top 0.1%
33.2%
Show abstract

Drug-target affinity (DTA) prediction is a key task in drug discovery, enabling the estimation of the interaction strength between candidate compounds and biological targets. However, current models rely on connectivity-based molecular representations and do not explicitly account for the spatial organization, also known as stereochemistry. This limitation becomes evident when considering chirality, where a drug can exist as enantiomers, i.e., molecules that share the same atoms and bonds but differ in their three-dimensional arrangement. Despite their chemical similarity, they can interact differently with the same target, leading to variations in binding affinity and biological activity. In this paper, we propose a stereochemistry-aware DTA prediction framework that incorporates this information into molecular representations. Drug representations are learned from chemical structure using a directed-bond message passing graph neural network that captures enantiomers configurations, while protein targets are represented through sequence-based embeddings. Experiments on the Davis dataset demonstrate that our model can improve affinity prediction. Importantly, a case study on a manually curated dataset of enantiomers with different biological action shows that the model is able to distinguish the affinities in the two forms consistent with their experimentally observed biological activity. These findings support the relevance of stereochemistry-aware molecular representation for more accurate and chemically faithful DTA prediction.

8
Generative Chemistry Platform for Small Molecules Targeting RNA: A Case Study for Chemical Optimization

Allen, T. E. H.; Bonnet, M.; Khan, R. T.

2026-05-12 bioinformatics 10.64898/2026.05.08.723908 medRxiv
Top 0.1%
33.1%
Show abstract

We introduce the Serna Bio GenAI platform, a generative chemistry and multiparametric optimization platform for the design of RNA-targeting small molecules. Targeting RNA with small molecules has proven historically challenging but offers notable potential upsides, including access to unique mechanisms of action and the ability to target otherwise untargetable genes. We consider a major challenge here to be designing chemistry specific to RNA-targeting. Molecular design is a valuable application of AI in drug discovery, but many publicly available models use training data focused on protein-targeting - the modality best historically explored in drug discovery. We showcase the difference and value in building a specifically RNA-targeting platform, comparing its performance to state-of-the-art public chemical generators and experimentally validating its chemical designs in comparison to chemistry designed by a human expert.

9
Does the sequence of a disordered protein encode small molecule binding paths?

Louet, A. A. B.; Hummer, G.; Vendruscolo, M.

2026-05-23 biophysics 10.64898/2026.05.20.726646 medRxiv
Top 0.1%
28.2%
Show abstract

Ligand binding to intrinsically disordered proteins resists description in terms of conventional binding pockets, yet it can be analysed as a dynamic process in which ligands move across transient surface interaction sites. Here we characterise a pathway-based representation in which ligand binding is described as a sequence of transitions between residue-defined microstates, enabling ligand-specific effects to be distinguished from intrinsic properties of the peptide conformational ensemble. Using all-atom molecular dynamics simulations of A{beta}42 and the C-terminal region of -synuclein in complex with chemically diverse small molecules, we construct transition matrices that encode ligand movement across the peptide surface and use Markov state models to identify dominant binding pathways and relative binding propensities. Pairwise enrichment-factor and AUC analyses reveal strong conservation of the highest-ranked pathways across chemically diverse ligands, with enrichment factors of 15-45 for the top-ranked states and AUC values typically [&ge;]0.75, well above random expectation. These dominant pathways are also preserved across changes in pH and temperature, whereas a urea control, included as a non-specific binder, shows reduced enrichment, indicating that ligands primarily modulate pathway weights rather than define the underlying network topology. Ensemble docking across chemically diverse libraries further supports the presence of recurrent ligand-accessible hotspots within the peptide conformational ensemble. Building on this framework, we apply a prospective screening pipeline to A{beta}42, combining MSM-derived hotspots with sequence-based Ligand-Transformer scoring and Gnina docking across 1.66 million compounds, to nominate 19 candidates for prospective experimental evaluation. Together, these results indicate that disordered protein sequences give rise to conformational ensembles that exhibit characteristic binding pathways for small molecules.

10
Structural bias in machine learning-guided peptide design

Aldas-Bulos, V. D.; Plisson, F.

2026-05-08 bioinformatics 10.64898/2026.05.06.721805 medRxiv
Top 0.1%
28.1%
Show abstract

Machine learning continues to accelerate peptide and protein design through the rapid prediction and generation of sequences with desired characteristics. Many applications focus on predicting properties, functions, and structures, as well as generating point mutations and de novo designs. Nevertheless, many models prove less generalizable than initially claimed. Most predictors and generators are trained on sequential datasets, where imbalances can be addressed during preprocessing. In contrast, structural bias, a subtype of algorithmic bias arising from uneven representation of structural classes in training datasets, and the limitations of early protein structure predictors have frequently remained undetected and uncorrected. The recent surge in powerful protein structure prediction tools, such as the AlphaFold and RosettaFold series and their variants, now presents opportunities to mitigate this issue. We hypothesize that such structural sampling biases influence the downstream performance of ML models. Using antimicrobial peptides as a case study, we audited the structural biases in 16 state-of-the-art predictors for antimicrobial activity and tested whether structural information constrains their predictions. Our analysis revealed that models explicitly trained on sequential data still produce predictions biased by uneven fold representations and data leakage. These findings highlight the importance of integrating balanced structural data or implementing bias-mitigating strategies to develop agnostic models that maximize bioactive protein discovery and multi-objective optimization.

11
Computational Design of Novel Selective Phosphodiesterase 4B Inhibitors from Natural Products: An Integrated Machine Learning and Structure-Based Drug Discovery Approach

Oni, S. A.; Oyemomi, M. D.; Osho, A.; Abdulfatai, A.

2026-05-19 bioinformatics 10.64898/2026.05.16.725619 medRxiv
Top 0.2%
23.7%
Show abstract

Selective inhibition of phosphodiesterase 4B (PDE4B) remains a promising strategy for preserving the anti-inflammatory benefit of PDE4 inhibition in chronic obstructive pulmonary disease while reducing PDE4D-associated tolerability liabilities. This study integrated SHAP-interpretable machine learning, natural product virtual screening, hierarchical docking, post-docking MM-GBSA, isoform cross-docking, binding-pocket comparison, ADMET prediction, and 100 ns molecular dynamics simulations to identify PDE4B-selective inhibitors from the LOTUS natural product database. A Random Forest classifier trained on curated ChEMBL PDE4B bioactivity data achieved an external performance with AUC-ROC = 0.955, accuracy = 0.893, F1-score = 0.896, MCC = 0.785, and prioritized 119,698 predicted actives from 276,518 LOTUS compounds. SHAP analysis identified BertzCT and TPSA as major contributors to predicted activity. Sequential Lipinski, PAINS, and QED filtering retained 14,210 candidates for structure-based evaluation. Extra precision docking identified four leads with PDE4B docking scores of -9.123 to -12.080 kcal/mol, all outperforming roflumilast (-7.658 kcal/mol). Cross-docking and post-docking MM-GBSA supported preferential PDE4B binding for three candidates. The top lead, LTS0048837, maintained a stable PDE4B-bound pose during simulation, with comparatively stronger interaction persistence than its PDE4D complex and the roflumilast reference. These findings nominate LTS0048837 as a computationally prioritized PDE4B-selective natural product lead requiring experimental enzyme, cellular, and pharmacokinetic validation.

12
Benchmarking generative AI and physics based molecular simulation for sampling conformational heterogeneity in T4 Lysozyme

Bhakat, S.

2026-05-13 biophysics 10.64898/2026.05.10.724101 medRxiv
Top 0.2%
23.6%
Show abstract

Wild-type T4 lysozyme (T4L) is used as a benchmark to evaluate conformational sampling across generative AI, AI-accelerated molecular simulation (AMS), and physics-based enhanced molecular dynamics (EMD). A four-state model: exposed/open, exposed/closed, buried/open, and buried/closed; is defined using physically meaningful collective variables. While generative AI methods (AF-cluster, MSA subsampling of AlphaFold2, ConforFold, AlphaFlow, ESMFlow, ConfRover, BioEmu) largely sample only the exposed/open state, AMS integrating generative ensembles with iterative molecular dynamics, recovering all states and reproducing equilibrium populations similar to EMD and experimental smFRET signatures.

13
On the applicability domain of HADDOCK3 for protein-aptamer docking: documented failure modes from a 5x7 cross-target screening matrix and a 1676 aa receptor case study (P01031)

Dohi, E.

2026-05-12 bioinformatics 10.64898/2026.05.11.724398 medRxiv
Top 0.2%
23.0%
Show abstract

We screened a 5 receptor x 7 aptamer = 35-cell cross-target matrix with HADDOCK3 [1] under blind ambiguous-interaction-restraint (AIR) protocols on AlphaFold-modelled receptors. The screen surfaced 12 operationally distinct failure modes (collapsing to [~]8 conceptual classes; [&#167;]3.1). The K_D-calibration subset is n = 4 cells with literature K_D records under matched assay conditions; the broader cohort includes [&ge;] 6 biological cognate or intended-cognate cells. The principal case study is P01031 (complement C5, 1676 aa, [&ge;] 12 structural domains): all 7 panel members produced positive HADDOCK3 top-1 scores under a scale-adaptive AIR. Score-term decomposition locates the anomaly in the AIR term (+217 to +268 to top-1 score). With AIR zeroed, scores fall to -131 to -74 -- the small-receptor regime. Boltz-2 cofolding chain-pair ipTM (cpi_AB) is an independent channel: P01031 shows the lowest median cpi_AB (0.211; 0/7 above the 0.5 confident-interface threshold). To our knowledge, this is the first reported case study of a 1676 aa multi-domain receptor exhibiting this signature under blind scale-adaptive AIR -- an n = 1 mechanistic case, not a statistical generalisation. We adapt the QSAR applicability domain concept [14-16] to in silico aptamer screening. [&#167;]3.7 reports an empirical Mode 1 mitigation (pLDDT-aware AIR prefilter; cohort Jaccard recovery [~]10x).

14
Reparameterization of the Amber RNA Force Field Non-Bonded Terms

Puthenpeedikakkal, A. M. K.; Cavender, C. E.; Smith, L. G.; Grossfield, A.; Mathews, D.

2026-05-19 biochemistry 10.64898/2026.05.18.725894 medRxiv
Top 0.2%
22.7%
Show abstract

All-atom simulations of RNA using molecular dynamics have the promise of modeling conformational preferences, folding thermodynamics, conformational change kinetics, and binding affinities of small molecule therapeutics. These simulations rely on a force field, a set of equations and parameters that model the potential energy as a function of conformation using classical mechanics. One popular force field for RNA is Amber OL3, with the most recent iteration derived in 1999 and with subsequent updates to backbone dihedral parameters. The Amber force field, while frequently used, is known to have limitations; for example, it does not properly stabilize native structures against alternative structures. Here, we provide a new approach to fitting the non-bonded parameters for the force field, specifically atom-centered point charges for electrostatics and the Lennard-Jones parameters. The parameters are fit to quantum mechanics (QM) interaction energies calculated with symmetry-adapted perturbation theory (SAPT), including embedded point charges to represent the electrostatic field from solvent and adjacent nucleotides. In this pilot study with a limited set of fitting data, we use the Amber ff99 equations and atom types unchanged. With the revised parameters, we observe improvement in the stability of native structures relative to alternative structures. Native tetraloop conformations, which unfold with the Amber OL3 force field, are stable on the microsecond timescale with our new force field parameters. We also see improvement in the conformational preferences of tetramers. Crucially, A-form helices are still well-modeled, but we observe additional flexibility in an internal loop that is not consistent with NMR data. Overall, we provide evidence that this new approach to fitting RNA force field parameters to SAPT interaction energies with native-structure context represented as embedded point charges is promising. It offers a flexible solution for revising the equations in future work or for extension to other molecules that interact with RNA, such as proteins and small molecules. We call this new set of force field parameters Amber RNA.ROC26.

15
Advancing in silico drug design with Bayesian refinement of AlphaFold models

Sen, S.; Hoff, S. E.; Morozova, T. I.; Schnapka, V.; Bonomi, M.

2026-05-06 bioinformatics 10.1101/2025.06.25.661454 medRxiv
Top 0.2%
22.6%
Show abstract

Virtual screening has become an indispensable tool in modern structure-based drug discovery, enabling the identification of candidate molecules by computationally evaluating their potential to bind target proteins. The accuracy of such screenings critically depends on the quality of the target structures employed. Recent advances in protein structure prediction, particularly AlphaFold2, have revolutionized this field with unprecedented accuracy. However, AlphaFold2 models often exhibit limitations in local structural details, especially within binding pockets, which limit their utility for small molecule docking. In contrast, molecular dynamics simulations with accurate atomistic force fields can refine protein structures, but lack the ability to leverage the structural information provided by deep learning approaches. Here, we introduce bAIes, an integrative method that bridges this gap by combining physics-based force fields with data-driven predictions through Bayesian inference. Crucially, bAIes demonstrates a superior ability to discriminate between binders and non-binders in virtual screening campaigns, outperforming both AlphaFold2 and molecular dynamics-refined models. By enhancing the usability of AlphaFold2 models without requiring extensive experimental or computational resources, bAIes offers a convenient solution to a longstanding challenge in structure-based drug design, potentially accelerating the early phases of drug discovery.

16
Pathway Representation via Intrinsic Structural Medoids (PRISM): A Structural Mapping Approach to Clustering Molecular Pathways

Brylle Woody Santos, J.; Leung, J.; Chong, L.; Miranda Quintana, R. A.

2026-05-19 biophysics 10.64898/2026.05.16.725628 medRxiv
Top 0.2%
22.5%
Show abstract

We present Pathway Representation via Intrinsic Structural Medoids (PRISM), a state-aware framework for clustering pathways from molecular dynamics simulations of biomolecular transitions. In PRISM, each pathway is mapped to a small set of structural medoids obtained via a deterministic k-means clustering scheme. Pairwise pathway dissimilarities are computed using a weighted average Hausdorff distance between these representative sets, effectively capturing mean nearest-neighbor structural deviations while reducing sensitivity to outliers. Hierarchical agglomerative clustering of the resulting dissimilarity matrix defines pathway families. We evaluate PRISM across three biomolecular transitions of increasing complexity: alanine dipeptide C7eq [-&gt;] C7ax isomerization, adenylate kinase opening, and HIF-2 PAS-B ligand unbinding. PRISM consistently yields robust cluster assignments, with medoids faithfully representing distinct conformational states. By combining a state-based description with robust geometric dissimilarities, PRISM provides a scalable framework for organizing complex transition pathways.

17
Pharmacological proximities in the GPCR family discovered using contact-informed amino-acid and binding pocket similarities

So, S. S.; Ngo, T.; Ilatovskiy, A. V.; Finch, A. M.; Riek, R. P.; Abagyan, R.; Smith, N. J.; Kufareva, I.

2026-05-06 bioinformatics 10.64898/2026.05.02.720972 medRxiv
Top 0.2%
22.0%
Show abstract

Understanding protein proximities in the theoretical ligand space is essential for developing therapeutics with desirable polypharmacology, predicting off-targets, and discovering surrogate ligands for poorly characterized proteins. This is especially important for G protein-coupled receptors (GPCRs) - a major class of drug targets, many of which still lack known ligands. Circumventing this limitation, we present GPCR-CoINPocket v2, a contact-informed metric for detecting GPCR pharmacological similarities from amino-acid sequences alone. We first establish a "gold standard" of pharmacological relatedness using ChEMBL-derived ligand sets. We then replace traditional evolutionary amino acid similarity matrices with a chemically-informed matrix derived from protein:ligand interaction patterns across 3,306 structures, significantly improving early detection of shared pharmacology between distantly homologous receptors. An additional unconstrained, contact-informed matrix further enhances predictive performance. Pilot application of the method revealed previously unrecognized similarities between the {beta}2 adrenoceptor and three Class A peptide GPCRs, which we confirmed experimentally by demonstrating the binding of select ligands of these receptors to the {beta}2. Dimensionality reduction of similarity scores recapitulates known receptor relationships and predicts neighbors of orphan GPCRs later confirmed experimentally. Overall, GPCR-CoINPocket v2 provides a powerful sequence-based framework to prioritize ligand space, predict polypharmacology, and accelerate GPCR drug discovery and deorphanization.

18
Highly Accurate Estimation of the Fold Accuracy of Protein Structural Models

Xie, L.; Ye, E.; Wang, H.; Zhang, T.; Zhen, Q.; Liang, F.; Liu, D.; Zhang, G.

2026-05-13 bioinformatics 10.64898/2026.04.15.718808 medRxiv
Top 0.2%
19.3%
Show abstract

BackgroundThe function of a protein is intrinsically linked to its three-dimensional fold, and deep learning has revolutionized the field by enabling high-accuracy structure prediction at an unprecedented scale. Nevertheless, the growing deployment of these predictive pipelines in drug discovery and structural biology reveals a critical bottleneck that lies in the lack of independent and rigorous estimation of model accuracy (EMA) methodologies. ResultsHere we present DeepUMQA-Global, a single-model deep learning framework for estimating accuracy of protein structure models. Our method employs a structure-sequence cross-consistency mechanism to evaluate the bidirectional compatibility between the predicted structure and the input sequence, enabling comprehensive characterization of fold accuracy. DeepUMQA-Global outperforms the self-assessment confidence scores of AlphaFold3, achieving improvements of 57.8% in Pearson correlation and 49.0% in Spearman correlation. With respect to the CASP16 retrospective benchmark, DeepUMQA-Global outperforms all single-model accuracy estimation methods that participated in CASP16 and achieves performance comparable to that of the top consensusObased methods. A lightweight consensus strategy built upon DeepUMQA-Global ranks first among all CASP16 participants, surpassing all other methods, including consensus approaches, and highlighting the strength of our method. Remarkably, DeepUMQA-Global demonstrates a strong ability to discriminate between alternative conformational states of proteins, as evidenced in the CASP unique alternative conformation protein complex target and the CoDNaS benchmark. ConclusionsOur results indicate that DeepUMQA-Global can be extended to broader protein modeling tasks, moving beyond static evaluation to offer a foundation for dynamic conformation EMA, where it accurately discriminates alternative conformational states and demonstrates reliable predictive fidelity in model accuracy estimation.

19
Bridging LLM Reasoning and Chemical Knowledge via an Evolutionary Multi-Agent Framework for Molecular Synthesis

Chen, Y.; Rao, J.; Xie, J.; Sun, Y.; Yang, Y.

2026-05-06 bioinformatics 10.64898/2026.05.02.722342 medRxiv
Top 0.3%
19.0%
Show abstract

MotivationMolecular design faces the dual challenge of navigating a vast chemical space while ensuring experimental synthesizability. Traditional models are constrained by small datasets, restricting their scalability and broader chemical context. In contrast, Large Language Models (LLMs) encapsulate extensive synthesis protocols derived from vast scientific literature, yet they struggle to leverage this potential due to severe hallucinations and a superficial grasp of rigorous chemical logic. ResultsWe propose EvoSyn, an evolutionary multi-agent framework that synergizes LLM reasoning with domain experts for preference-aware molecular synthesis. EvoSyn orchestrates a dual-process evolutionary paradigm: a co-evolving process that collaboratively aligns linguistic capabilities with multi-objective constraints, and a self-evolving process formulated as a Markov Game. Through evolution and reinforcement learning, agents actively learn from mistakes, utilizing domain feedback to penalize invalid proposals and ground generation in feasible reaction pathways. Extensive evaluations on comprehensive benchmarks demonstrate that EvoSyn significantly outperforms state-of-the-art baselines. These results highlight that by integrating LLM-guided self-evolution with rigorous domain validation to mitigate hallucinations, EvoSyn effectively yields molecules that are both bioactive and synthetically actionable. Availability and implementationImplementation code is available as supplementary material. Contactyangyd25@mail.sysu.edu.cn Supplementary informationSupplementary data are available at Bioinformatics online.

20
Therapeutic Relevance of NLPA Lipoprotein to Combat Biofilm-Associated infection in Acinetobacter baumannii

Brahma, V. U.; Munagalasetty, S.; Bhandari, V.

2026-05-20 bioinformatics 10.64898/2026.05.18.725845 medRxiv
Top 0.3%
18.7%
Show abstract

Acinetobacter baumannii is a leading multidrug-resistant critical priority pathogen in healthcare settings, where biofilm formation confers survival and antibiotic tolerance. Targeting virulence associated proteins offers an alternative to conventional bactericidal strategies. Here, the inner membrane anchored lipoprotein NLPA, implicated in biofilm associated adaptation, was studied as a putative anti-virulence target using an integrated in silico pipeline and complementing the computational findings. The Alpha fold-derived structure of NLPA served as the basis for virtual screening of approximately 1.6 million compounds, with subsequent prioritization guided by MM/GBSA calculated binding free energies to highlight the top promising candidates. Molecular dynamics simulations demonstrated stable NLPA ligand complexes, as indicated by equilibrated RMSD, low residue fluctuations in the binding region, and persistent interaction networks over time. Pharmacokinetic evaluation indicated that the compounds satisfied Lipinskis Rule of Five and had overall acceptable ADMET characteristics. Two compounds, NLPA-6 and NLPA-3, showed the most favourable predicted binding free energies, suggesting strong and stable interactions within the NLPA binding site. NLPA-3 was evaluated in vitro against A. baumannii to validate the computational outcomes. The compound displayed moderate antibacterial activity with a MIC of 125 g/mL and demonstrated 55.75% inhibition of biofilm formation at 4x MIC. In addition, in macrophage infection studies, NLPA-3 decreased intracellular bacterial survival to 19.25% at 50 g/mL, suggesting that it may disrupt virulence pathways linked to persistence. In whole, these findings identify promising NLPA targeting compounds and support the feasibility of NLPA as an anti-virulence target.